Minoo
Erfani Joorabchi,
Mona Erfani Joorabchi,
Dr. Christopher D. Shaw,
Version 5.1 of
Tableau was used for solving Mini Challenge 2. Tableau data visualization software was
a project in Stanford University in 1997 by Chris Stolte and Professor Pat
Hanrahan. They invented a database visualization language called VizQL (Visual
Query Language) by combining a structured query language for databases with a
descriptive language for rendering graphics. Tableau Desktop was released in
2003. With Tableau, analysts can connect to large amounts of data and
auto-generates visualizations including reports, tables, charts and graphs through
a drag-and-drop interface.
Microsoft
Excel was used for combining the measured attributes of all eleven
cities/countries. Again the excel file was inserted to Tableau for further
analysing and comparing cities/countries.
Video:
MC2 Video
(.mp4)
ANSWERS:
MC2.1: Analyze the
records you have been given to characterize the spread of the disease. You should take into consideration symptoms
of the disease, mortality rates, temporal patterns of the onset, peak and
recovery of the disease. Health
officials hope that whatever tools are developed to analyze this data might be
available for the next epidemic outbreak.
They are looking for visualization tools that will save them analysis
time so they can react quickly.
Approach: We loaded two files of each eleven cities/countries, to Tableau. The provided files were in the form of .csv with
varied sizes from 90K records to 7 million records. Using Tableau for
MC2 was efficient regarding the process of analyzing the data as well as the
work needed to be done to enter the data, in particular for the huge datasets
like Karachi and Aleppo. We used the following approach for each city/country
e.g., for Colombia there were Colombia.csv and Colombia-deaths.csv with
attribute ID in both tables as primary key. As mentioned in the beginning of
the video we Left joined the two
files (Colombia.csv on the left side and Colombia-deaths.csv on the right) in
order to save all the attributes of patients and deaths. Therefore we had a
joint table in Tableau for each city/country with ID, Age, Gender, Symptoms,
Admittance date and Death date (Death date had valid value for those patients
who died and null value otherwise). When we connected Tableau to the text file
(because .csv file is read as text file in Tableau), it created the joint table
quickly depending on file size e.g., for huge files it took up to 4 minutes to
connect to the dataset. After the data import stage, we started analyzing the table by dragging and dropping attributes. Similarly, the processing time for huge datasets took sometimes up to 4
minutes.
Findings:
A set of 92 symptoms were identified in the
outbreak. Although not all the patients with these 92 symptoms died, however
all of these 92 symptoms caused large number deaths after exactly 8 days
(Table1). Since the hospitalization time for patients died in the outbreak was
exactly 8 days, we deduced that if a patient in the outbreak could survive more
than 8 days he/she then would recover.
The mortality rates of cities/countries which were
involved in the outbreak was high, Mean = .097, in comparison to the mortality rate
while excluding the patients died in the outbreak, Mean = .0001. Additionally,
having a low standard deviation, SD = .0008, indicated that the mortality rates
of cities/countries which were involved in the outbreak tended to be very close
to the mean (Figure7, Table2).
Patients in the outbreak had hospital admittance
date started April 16th until June 29th with a peak in
May 13th-21st. Similarly, those who died in the outbreak
had hospital admittance date started April 16th-22nd
until Jun 18th-26th with a peak in May 14th-20th.
Their death date started Apr 24th-May 4th until Jun 24th-30th
with a peak in May 22nd-28th in different
cities/countries (Figure2, Table3).
Gender and age were not significant factors
as they distributed normally in all cities/countries for both patients and
deaths (normal curve).
Further
investigation:
There were 1,308-1,310 symptoms in each
city/country. Figure1 showed number of patients and deaths for all symptoms in
A. B.
Figure1.
At first glance, ‘NOSE BLEEDS' had the highest
number of people infected which was almost double the rest of 91 symptoms
(Figure1, red arrows), however further investigation assisted in grouping them
into the following 39 distinct Types of symptoms. Since the numbers of patients
(or deaths) caused by these symptoms were quite in the same range (Figure1), we
believed in this outbreak ‘Type 21’ was the most common symptom, then ‘Type 3’,
5, 33, 31, 28, etc in (Table1, gray cells). Therefore vomiting, diarrhea,
abdominal pain, back pain, fever and nose bleed were among very common
symptoms.
Table1.
Categorizing top 92 symptoms into 39 Types.
Type 92 Symptoms 1 ABD AND BACK PAIN ABDBACK PAIN 2 ABD CRAMPING ABDOMINAL CRAMPING 3 ABD PAIN ABD PAINABD PAIN ABD PAINS ABD PN ABD. PAIN ABD.PAIN ABDOMEN PAIN ABDOMINAL PAIN ABDOMINAL PAINABDOMINAL PAIN 4 ABD PAIN FEVER ABD PAIN, FEVER ABD PAINFEVER 5 ABD PAIN VOMITING ABD PAIN, VOMITING ABD PAIN,VOMITING ABD PAINVOMITING ABDOMINAL PAIN VOMITING ABDOMINAL PAIN, VOMITING VOMITING ABD PAIN VOMITING, ABD PAIN VOMITINGABD PAIN 6 ABD PAIN, NAUSEA 7 ABD PX 8 ABDMNAL PAIN UNSPCF SITE 9 ABDOMINAL PAIN ACUTE 10 ABNORMAL LABS 11 BACK AND LEG PAIN 12 BACK AND NECK PAIN 13 BACK INJ BACK INJURY 14 BACK PAIN BACK PAINBACK PAIN BACK PAINS BACK PN 15 BACK PAIN STRAIN BACK PAINSTRAIN 16 BACK PX 17 BACK SPASMS 18 BACK STRAIN 19 CONJUNCTIVITIS RED 20 DIARRHEA Type 92 Symptoms 21 DIARRHEA AND VOMITING DIARRHEA VOMITING DIARRHEA, VOMITING DIARRHEA,VOMITING VOMITING DIARRHEA VOMITING DIARRHEA VOMITING & DIARRHEA VOMITING AND DIARRHEA VOMITING DIARRHEA VOMITING, DIARRHEA VOMITING,DIARRHEA VOMITINGDIARRHEA 22 DIARRHEA FEVER DIARRHEA, FEVER 23 ENCEPHALITIS 24 FACIAL SWELLING 25 HEAD-FACE-NECK SWELLING 26 HEARING LOSS 27 NOSE 28 NOSE BLEED NOSE BLEEDNOSE BLEED NOSE BLEEDS (almost twice as others) 29 PROTEINURIA 30 TREMORS 31 VOMITING VOMITING ALL DAY VOMITING ALONE VOMITTING VOMTING 32 VOMITING BLURRED VISION 33 VOMITING FEVER VOMITING AND FEVER VOMITING FEVER VOMITING, FEVER VOMITING,FEVER VOMITINGFEVER 34 VOMITING BLOOD VOMITING BLOODVOMITING BLOOD VOMITTING BLOOD 35 VOMITING DIARRHEA FEVER VOMITING, DIARRHEA, FEVER VOMITING,DIARRHEA,FEVER 36 VOMITING RASH 37 VOMITING, NECK PAIN, HEADACHE, BLURRED VISION 38 VOMITING,STOMACH HURTS 39 VOMITINGHEADACHE
Following the lead that these 92 symptoms were
likely to be the symptoms in the outbreak, we filtered the data to visualize
only the patients in the outbreak. One could recognize a peak in May on the
hospital admittance date of patients and deaths in the outbreak
(Figure2.B.Bottom) which slightly affected the hospital admittance date of all
patients (Figure2.B.Top). On the other hand all deaths (Figure2.A.Top) were
influenced hugely by the deaths in the outbreak (Figure2.A.Bottom). The peak
for deaths of the outbreak was late May.
Hospital admittance of deaths in
outbreak; they died in mid May and early June Hospital admittance of patients in outbreak Jump Time Peak Time Peak Time Jump Time
A.
B.
Figure2.
Regarding recovery of the disease, we found out all
the deaths of the outbreak died exactly after 8 days. Figure3.A displayed the
correlation between hospital admittance dates versus death dates of whole
deaths in
A.
B. C.
Figure3.
Deaths in Colombia. A) Hospital admittance date vs. death date. B) Hospital
admittance date for those died in the outbreak vs. death date. C) Hospital
admittance date for those who not in the outbreak vs. death date.
To ensure that the symptoms of the outbreak are only
limited to these 92 symptoms, we checked a hypothesis that these top 92
symptoms could be severe ones, and the next jump to 1,173/1,174 symptoms in
Figure1 could be the moderate ones which didn’t cause major deaths. However,
investigating the rest of symptoms showed that there was no peak or temporal
patterns on hospital admittance for those (Figure4).
Figure4. Patients in Colombia excluding ones with 92
symptoms. Top) Hospital admittance. Bottom) Deaths date.
MC2.2: Compare
the outbreak across cities. Factors to
consider include timing of outbreaks, numbers of people infected and recovery
ability of the individual cities.
Identify any anomalies you found.
Findings:
In an attempt to compare the outbreak across cities,
we divided cities into two groups:
·
First
group was the 9 cities/counties of
·
Second
group was two counties of
Since all the 9 cities/countries in the first group
were similar for different factors, namely patterns of the symptoms in the
outbreak, mortality rates, hospitalization time for deaths, timing of
outbreaks, etc., we believed the cities/countries in first group were mainly
affected by the pandemic and countries in the second group, Thailand and
Turkey, were unaffected by the disease.
Further investigation on the diagrams of date of
deaths and hospital admittance of patients and deaths in the 9 cities/countries
of first group, we identified that Aleppo, Yemen, Karachi, Colombia and Nairobi
were quite identical (had an exact normal curve) while Lebanon, Venezuela, Iran
and Saudi Arabia were also identical (they had still normal curve with a small
curve, belly at the beginning of the normal curve) as it shown in the video. The outbreak started Apr 16
until Jun 29 for the countries in the
first groups. Comparing the timing of outbreaks, we identified the peak of the
disease was a bit varied starting from May 13 to May 21 in different cities/countries
and it was roughly in the following order: Kenya (Nairobi), Syria (Aleppo), Lebanon, Yemen, Pakistan
(Karachi), Saudi Arabia, Venezuela, Colombia and Iran.
Additionally the deaths date in the outbreak started
around [Apr 24-May4] until [Jun 24-30] in the 9 countries. Also there was a
jump in their diagrams which was shown the main start for the deaths in the
outbreak (Figure2.A.Bottom). Both jump time and peak time for deaths date in
the outbreak followed the same order as highlighted above (Figure7, Figure8).
Thus we concluded that
Further
investigation on Turkey and Thailand:
Regarding the symptoms, all the 9 countries in the
first group followed the same pattern for symptoms as shown in Figure1. However
number of patients and deaths of different symptoms in
7 common symptoms in Turkey were 'nose bleeds',
'headache, blurred vision', 'LT hand injury', ' tonsil bleed', 'elevated temp',
'LT arm pain', and 'rash on hands'. Only ‘headache, blurred vision' caused to
death. Similarly in Thailand 6 symptoms of 'LT hand injury', ' tonsil bleed',
'elevated temp', 'headache, blurred vision', ' nose bleeds' and 'broken tooth'
were the highest symptoms for the patients which none of them caused to death.
Although headache, blurred vision, and nose bleeds were in common with the 92
symptoms in the outbreak but not the same specific pattern in the outbreak were
identified in the patients and deaths for
Symptoms
A.
B.
Figure5- Turkey. A) Symptoms which caused death. B)
All the symptoms.
Moreover the
mortality rates in
Additionally the
hospitalization time for deaths in Turkey and Thailand were 4.9 which was
similar to hospitalization time for deaths in the rest of the countries while
excluding deaths in the outbreak (Mean = 5.17). it was also very different from
the hospitalization time for deaths in the outbreak which was exactly 8 days
(Table2).
A. B.
Figure6- Turkey. A.Top) Hospital admittance of
patients (blue) and hospital admittance of deaths (orange). A.Bottom) Hospital
admittance of patients (blue) and hospital admittance of deaths (orange). B) Hospital admittance of
deaths vs. their date of death.
Further
investigation on the outbreak across cities/countries:
The number of
patients and deaths in the outbreak for countries may not be as useful as
mortality rate because we did not have the whole number of population, Figure7
compared number of patients and deaths in the outbreak color coded by their
admittance date. Figure7 also compared the mortality rates of outbreak color
coded by their peak of death time for all the cities/countries. In terms of
mortality rate, we noticed that Syria (Aleppo) and Kenya (Nairobi) had the
highest mortality rate while Saudi Arabia and Lebanon had the lowest. It also
displayed the peak time was started earliest in
Figure7. Numbers of patients of outbreak, numbers of
deaths of outbreak and mortality rates of outbreak.
Although Figure8 compared the jump time and
peak time of deaths date as well as peak time of hospital admittance date for
patients and deaths in the outbreak, they followed almost the same order as mentioned
above: Kenya (Nairobi), Syria (Aleppo), Lebanon, Yemen, Pakistan (Karachi),
Saudi Arabia, Venezuela, Colombia and Iran.
Figure8. Day of Jump time and Peak time for deaths of
outbreak.
Table2 and Table3 displayed more facts and
figures of different attribute in the cities/countries.
Table2.
Hospitalization time for deaths of outbreak and non-outbreak, Mortality Rate of
outbreak and non-outbreak and No. of Symptoms of deaths across
cities/countries.
Country Mean
Hospitalization Time for deaths No. of Symptoms of deaths Outbreak Non-outbreak Outbreak Non-outbreak Mortality Rate #Patients #Deaths Mortality Rate #Patients #Deaths 8 5.412 679 .1080 398,406 43,065 .00100 892,653 894 8 5.378 942 .1090 705,856 76,993 .00109 1,536,792 1,679 8 5.161 397 .0865 84,367 7,304 .00095 356,822 342 8 5.371 298 .0999 74,842 7,479 .00102 226,945 232 8 5.195 1,280 .0967 1,651,586 159,866 .00104 5,503,338 5,739 8 5.042 809 .0844 242,130 20,439 .00100 1,085,432 1,090 8 5.079 187 .1002 36,060 3,616 .00093 107,746 101 8 5.160 544 .0968 162,698 15,751 .00106 542,583 580 8 5.209 449 .0943 121,812 11,496 .00102 421,436 430 - 4.919 85 - - - .00097 88,457 86 - 4.961 255 - - - .00092 303,987 280 Average 8 5.171 - .0973 - - .0001 -- - Table3.
Date of deaths of outbreak and hospitalization time for both patients and
deaths of outbreak. Country Outbreak-
Date of deaths Patient Deaths Start Time Jump Time Peak Time End Time Start Time Peak Time End Time Start Time Peak Time End Time Apr 29 May 4 May 22 Jun 24 Apr 16 May 14 Jun 29 Apr 17 May 14 Jun 18 May 1 May 5 May 23 Jun 25 Apr 16 May 13 Jun 29 Apr 20 May 15 Jun 18 Apr 24 May 6 May 25 Jun 30 Apr 16 May 16 Jun 29 Apr 20 May 17 Jun 18 May 1 May 6 May 24 Jun 27 Apr 16 May 16 Jun 29 Apr 19 May 16 Jun 21 May 2 May 6 May 24 Jun 26 Apr 16 May 17 Jun 29 Apr 17 May 17 Jun 25 May 3 May 7 May 26 Jun 28 Apr 16 May 19 Jun 29 Apr 20 May 18 Jun Apr 25 May 8 May 27 Jun 28 Apr 16 May 19 Jun 29 Apr 21 May 19 Jun 20 May 4 May 9 May 28 Jun 29 Apr 16 May 20 Jun 29 Apr 22 May 20 Jun 23 May 2 May 8 May 27 Jun 29 Apr 16 May 21 Jun 29 Apr 24 May 19 Jun 25 - - - - - - - - - - - - - - - - - - - - Average Apr 16 Jun 29